# Data Dictionary

This document describes the variables included in the main dataset files used for analysis.



| Variable Name  | Description | Data Type | Example |
|----------------|-------------|-----------|---------|
| `videoId` | Unique identifier of the YouTube video where the comment was posted | String | `mSBtHTBzg_Y` |
| `channel` | Name of the YouTube channel | String | `WallStreetJournal` |
| `authorID` | Anonymized user ID of the comment author | String / Integer | `7729` |
| `commentID` | Anonymized unique identifier for the comment | String / Integer | `6669` |
| `publishedAt` | Date and time when the comment was published | DateTime (MM/DD/YY HH:MM) | `10/31/24 14:01` |
| `isReply` | Indicates whether the comment is a reply to another comment | Boolean (`TRUE`/`FALSE`) | `FALSE` |
| `cleaned_comment` | Preprocessed text of the comment (after cleaning) | String | `pennsylvania has already become ground zero for electionfraud claims` |
| `comment_length` | Length of the cleaned comment, measured in words or tokens | Integer | `9` |
| `umap_embedding` | Vector representation of the comment obtained from UMAP dimensionality reduction | List of Floats | `[7.306291580200195, 7.151223182678223, 6.620931148529053, 12.594696044921875, 3.280472755432129]` |
| `cluster_topic` | Assigned topic category for the comment | Categorical | `Democracy` |
| `date` | Extracted date portion of `publishedAt` | Date (MM/DD/YY) | `11/1/24` |
| `hour` | Hour of day (from `publishedAt`) | Integer (0–23) | `14` |

## Notes

- The `umap_embedding` field represents 5-dimensional coordinates used for clustering and topic modeling.
- The `cluster_topic` corresponds to the assigned topics from GPT-4o based on TF-IDF of the `cleaned_comments` column. A subset of this column just for the 5 main categories is used in the `relevantcomments_dataset.csv`: **Immigration, Inflation, Public Health, Identity Politics, Democracy**.
- The anonymized IDs (`authorID`, `commentID`) have been generated to protect user privacy.

